home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group99a.txt
/
000124_icon-group-sender _Fri Jun 4 14:45:35 1999.msg
< prev
next >
Wrap
Internet Message Format
|
2000-09-20
|
5KB
Return-Path: <icon-group-sender>
Received: (from root@localhost)
by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) id OAA03881
for icon-group-addresses; Fri, 4 Jun 1999 14:41:55 -0700 (MST)
Message-Id: <199906042141.OAA03881@baskerville.CS.Arizona.EDU>
Delivered-To: icon-group@silliac.cs.arizona.edu
Date: Fri, 04 Jun 1999 09:09:29 -0700
From: Steve Wampler <swampler@noao.edu>
X-Accept-Language: en
To: icon-group@optima.CS.Arizona.EDU
Subject: Re: Find the longest matching substrings
Errors-To: icon-group-errors@optima.CS.Arizona.EDU
Status: RO
This is a multi-part message in MIME format.
--------------09EB4280837648B91A00FBDF
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
gep2@terabites.com wrote:
>
> > I am working on a project that will mark the inserted or deleted phases
> between two strings. I am wondering whether you know any library
> procedure that will anchor ( or report ) the longest phases that exists
> in both strings.
>
> > For example, assume that we have two strings:
> old:="other strings ab cd e and more substrings"
> new:="some more substrings ab cd ab cd e"
>
> > This function should return phases and position in both strings, i.e.
> [15, 28, "ab cd e"] for "ab cd e" is the longest phase that matched (
> it has three continuous matching elements ).
>
> If I were going to try to make something like this FAST (and especially if one
> of the strings were relatively constant) I'd probably want to use a hash (or
> even direct index?) table pre-computed with the offsets of character substrings
> (I'd probably use two-character adjacent pairs) in one of those strings; this
> would enable stepping rapidly through the second string and quickly finding at
> least (only just the) plausible candidates for launching more detailed substring
> compare (and length computation) operations. This kind of thing would likely be
> storage-hungry, but that's less of a concern today than it once was.
Taking Gordon's comment on speed to heart, here's an adaptation of the
previous solution that is faster (uses simple heuristics to avoid
looking at phrases that can't possibly be longer than ones that are
found already). Note that it can probably be sped up even more by
improving
these heuristics a bit.
This version is about 2.5x faster on my test case (the "that that is..."
nonsense). Applying this adaptation to my solution that ignores
punctuation
when comparing phrases results in a speedup of about 25x! (Because the
test case strings match each other when ignoring punctuation.)
Oh, I also modified the test driver to do time comparisions.
Incidently, it would be 'better' if the phrase element procedures were
converted into string scanning procedures. Scanning strings for phrase
elements sounds useful to me and can considered as a generalization
of scanning strings for characters - in fact, the code here is just
an adaptation of code to find the longest common substring.
It would also be better if the code were rewritten to accept different
definitions of a phrase element more easily!
--
Steve Wampler- SOLIS Project, National Solar Observatory
swampler@noao.edu
--------------09EB4280837648B91A00FBDF
Content-Type: text/plain; charset=us-ascii;
name="phrase3.icn"
Content-Disposition: inline;
filename="phrase3.icn"
Content-Transfer-Encoding: 7bit
procedure main(args)
n := args[1]|10
start := &time
every 1 to n do {
result := longPhrase(read(),read())
}
len := &time - start
write("[",result[1],",",result[2],",\"",result[3],"\"]")
write(real(len)/n,"ms per loop for ",n," loops")
end
# Find longest phrase common to s1 and s2. Fails if no common phrase.
#
# Assumes simple definition for a phrase (sequence of letters)
# Assumes punctuation is to be considered when matching phrases
#
procedure longPhrase(s1,s2)
static pChars
initial pChars := &letters # chars legal in phrase element
maxlen := 0
every i := pStart(s1, pChars) do {
every k := rpStop(s1, i, pChars) do {
if j := findPhrase(s1[i:k], s2, pChars) then {
if maxlen <:= numPhrases(s1[i:k], pChars) then {
saveStart1 := i
saveStop1 := k
saveStart2 := j
}
break # no need to find shorter phrases
} # that start here
}
}
return [\saveStart1, saveStart2, s1[saveStart1:saveStop1]]
end
procedure pStart(s, pc) # generate start of all phrases in s
local i
s ? {
while tab(i := upto(pc)) do {
tab(many(pc))
suspend i
}
}
end
procedure rpStop(s, i, pc) # generates end of all phrases in s,
# starting from right of string
suspend *s+2 - pStart(reverse(s[i:0]), pc)
end
procedure numPhrases(s, pc) # Count number of phrases in s
local count
count := 0
every pStart(s,pc) do count +:= 1
return count
end
procedure findPhrase(p, s, pc) # find phrase p in s
local i
s ? {
while tab(i := upto(pc)) do {
if match(p) then suspend i
tab(many(pc))
}
}
end
--------------09EB4280837648B91A00FBDF--